IBM HR Employee Attrition¶

Context¶

Organizations invest heavily in employee development, satisfaction, and retention. However, high attrition rates can lead to significant costs — including lost productivity, recruitment efforts, and onboarding. The ability to predict employee attrition can help HR departments take proactive steps to retain valuable talent.

This project uses IBM’s HR Analytics dataset, which contains detailed information about employees’ roles, compensation, performance, satisfaction, work environment, and more.

Objective¶

  • To identify the different factors that drive attrition
  • To build a model to predict if an employee will attrite or not

Dataset Description¶

Column Description
Age Age of the employee
Attrition Whether the employee left the company (Yes/No)
BusinessTravel Frequency of business travel
DailyRate Daily salary rate
Department Department the employee belongs to
DistanceFromHome Distance from employee's home to workplace
Education Education level (1–5)
EducationField Field of education (e.g., Life Sciences, Marketing)
EnvironmentSatisfaction Satisfaction with work environment (1–4)
Gender Employee gender
HourlyRate Hourly wage
JobInvolvement Level of job involvement (1–4)
JobLevel Employee job level (1–5)
JobRole Specific job title
JobSatisfaction Satisfaction with the job (1–4)
MaritalStatus Marital status
MonthlyIncome Monthly salary
MonthlyRate Monthly rate
NumCompaniesWorked Number of companies worked for prior
OverTime Whether employee works overtime (Yes/No)
PercentSalaryHike Percentage salary increase
PerformanceRating Performance rating (1–4)
RelationshipSatisfaction Satisfaction with relationships (1–4)
StockOptionLevel Stock option level
TotalWorkingYears Total years of professional experience
TrainingTimesLastYear Times participated in training in last year
WorkLifeBalance Work-life balance rating (1–4)
YearsAtCompany Years spent at the company
YearsInCurrentRole Years in the current role
YearsSinceLastPromotion Years since last promotion
YearsWithCurrManager Years under current manager

  • IBM HR Analytics Employee Attrition & Performance
In [1]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Importing Liberies¶

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# To scale the data using z-score
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

# Algorithms to use
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

# Metrics to evaluate the model
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve

# For tuning the model
from sklearn.model_selection import GridSearchCV

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")

sns.set()

Loading the dataset¶

In [3]:
# Reading the dataset
df = pd.read_csv('/content/drive/MyDrive/My DS DA/Employee Attrition/data.csv')
In [4]:
df.head()
Out[4]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 ... 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 ... 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 ... 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 ... 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 7 ... 4 80 1 6 3 3 2 2 2 2

5 rows × 35 columns

Data Overview¶

Info¶

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                  1470 non-null   int64 
 15  JobRole                   1470 non-null   object
 16  JobSatisfaction           1470 non-null   int64 
 17  MaritalStatus             1470 non-null   object
 18  MonthlyIncome             1470 non-null   int64 
 19  MonthlyRate               1470 non-null   int64 
 20  NumCompaniesWorked        1470 non-null   int64 
 21  Over18                    1470 non-null   object
 22  OverTime                  1470 non-null   object
 23  PercentSalaryHike         1470 non-null   int64 
 24  PerformanceRating         1470 non-null   int64 
 25  RelationshipSatisfaction  1470 non-null   int64 
 26  StandardHours             1470 non-null   int64 
 27  StockOptionLevel          1470 non-null   int64 
 28  TotalWorkingYears         1470 non-null   int64 
 29  TrainingTimesLastYear     1470 non-null   int64 
 30  WorkLifeBalance           1470 non-null   int64 
 31  YearsAtCompany            1470 non-null   int64 
 32  YearsInCurrentRole        1470 non-null   int64 
 33  YearsSinceLastPromotion   1470 non-null   int64 
 34  YearsWithCurrManager      1470 non-null   int64 
dtypes: int64(26), object(9)
memory usage: 402.1+ KB
  • Total Records: 1,470 employees

  • Total Features: 35 columns

  • Missing Data: None (all columns have 1,470 non-null values)

  • Numeric (int64): 26 columns (e.g., Age, MonthlyIncome, TotalWorkingYears)

  • Categorical (object): 9 columns (e.g., Gender, Department, JobRole)

  • Target Variable: Attrition (Yes/No)

  • Identifier: EmployeeNumber (unique ID, likely not useful for modeling)

  • Encoding Needed: Categorical columns must be encoded before modeling:

    • Attrition, BusinessTravel, Department, EducationField, Gender, JobRole, MaritalStatus, OverTime

Unique Values¶

In [6]:
df.nunique()
Out[6]:
0
Age 43
Attrition 2
BusinessTravel 3
DailyRate 886
Department 3
DistanceFromHome 29
Education 5
EducationField 6
EmployeeCount 1
EmployeeNumber 1470
EnvironmentSatisfaction 4
Gender 2
HourlyRate 71
JobInvolvement 4
JobLevel 5
JobRole 9
JobSatisfaction 4
MaritalStatus 3
MonthlyIncome 1349
MonthlyRate 1427
NumCompaniesWorked 10
Over18 1
OverTime 2
PercentSalaryHike 15
PerformanceRating 2
RelationshipSatisfaction 4
StandardHours 1
StockOptionLevel 4
TotalWorkingYears 40
TrainingTimesLastYear 7
WorkLifeBalance 4
YearsAtCompany 37
YearsInCurrentRole 19
YearsSinceLastPromotion 16
YearsWithCurrManager 18

  • Total Records: 1,470

  • Total Features: 35

  • No missing values in any column

  • Data Types: 26 numerical, 9 categorical

  • Attrition: Binary classification with 2 unique values (Yes/No)

  • Age: 43 unique values, likely continuous

  • Gender: 2 values (Male/Female)

  • MaritalStatus: 3 values

  • Over18: Only 1 value – not useful for modeling

  • Education: 5 levels

  • EducationField: 6 distinct education fields

  • JobRole: 9 roles

  • Department: 3 departments

  • JobLevel: 5 levels

  • Satisfaction metrics (JobSatisfaction, JobInvolvement, EnvironmentSatisfaction, RelationshipSatisfaction): 4 levels each

  • YearsAtCompany: 37 unique values

  • YearsInCurrentRole: 19

  • YearsSinceLastPromotion: 16

  • YearsWithCurrManager: 18

  • NumCompaniesWorked: 10

  • TotalWorkingYears: 40

  • TrainingTimesLastYear: 7

  • MonthlyIncome: 1,349 unique values – high variance

  • MonthlyRate: 1,427 unique values – near unique

  • HourlyRate: 71

  • DailyRate: 888

  • PercentSalaryHike: 15 values

  • StockOptionLevel: 4 levels

  • BusinessTravel: 3 levels

  • DistanceFromHome: 29 values

  • StandardHours: Only 1 value – drop candidate

  • OverTime: 2 values (Yes/No)

  • WorkLifeBalance: 4 levels

  • EmployeeCount, Over18, StandardHours: Constant – can be dropped

  • EmployeeNumber: Unique identifier – drop for modeling

In [7]:
df.columns.to_list()
Out[7]:
['Age',
 'Attrition',
 'BusinessTravel',
 'DailyRate',
 'Department',
 'DistanceFromHome',
 'Education',
 'EducationField',
 'EmployeeCount',
 'EmployeeNumber',
 'EnvironmentSatisfaction',
 'Gender',
 'HourlyRate',
 'JobInvolvement',
 'JobLevel',
 'JobRole',
 'JobSatisfaction',
 'MaritalStatus',
 'MonthlyIncome',
 'MonthlyRate',
 'NumCompaniesWorked',
 'Over18',
 'OverTime',
 'PercentSalaryHike',
 'PerformanceRating',
 'RelationshipSatisfaction',
 'StandardHours',
 'StockOptionLevel',
 'TotalWorkingYears',
 'TrainingTimesLastYear',
 'WorkLifeBalance',
 'YearsAtCompany',
 'YearsInCurrentRole',
 'YearsSinceLastPromotion',
 'YearsWithCurrManager']
  • Observations
    • Drop Columns
      • EmployeeNumber: identifier which is unique for each employee
      • Over18: have only 1 unique value
      • StandardHours: have only 1 unique value
      • EmployeeCount: constant
In [8]:
df = df.drop(['EmployeeNumber', 'Over18', 'StandardHours', 'EmployeeCount'], axis = 1)
In [9]:
# Creating numerical columns
num_cols = ['Age',
 'DailyRate',
 'DistanceFromHome',
 'Education',
 'EnvironmentSatisfaction',
 'HourlyRate',
 'JobInvolvement',
 'JobLevel',
 'JobSatisfaction',
 'MonthlyIncome',
 'MonthlyRate',
 'NumCompaniesWorked',
 'PercentSalaryHike',
 'PerformanceRating',
 'RelationshipSatisfaction',
 'StockOptionLevel',
 'TotalWorkingYears',
 'TrainingTimesLastYear',
 'WorkLifeBalance',
 'YearsAtCompany',
 'YearsInCurrentRole',
 'YearsSinceLastPromotion',
 'YearsWithCurrManager']

# Creating categorical variables
cat_cols = ['Attrition',
 'BusinessTravel',
 'Department',
 'EducationField',
 'Gender',
 'JobRole',
 'MaritalStatus',
 'OverTime']
In [10]:
df.isnull().sum().sum()
Out[10]:
np.int64(0)
In [11]:
df.duplicated().sum()
Out[11]:
np.int64(0)

Exploratory Data Analysis¶

Univariate analysis of numerical columns¶

In [12]:
# Checking summary statistics
df[num_cols].describe().T
Out[12]:
count mean std min 25% 50% 75% max
Age 1470.0 36.923810 9.135373 18.0 30.0 36.0 43.00 60.0
DailyRate 1470.0 802.485714 403.509100 102.0 465.0 802.0 1157.00 1499.0
DistanceFromHome 1470.0 9.192517 8.106864 1.0 2.0 7.0 14.00 29.0
Education 1470.0 2.912925 1.024165 1.0 2.0 3.0 4.00 5.0
EnvironmentSatisfaction 1470.0 2.721769 1.093082 1.0 2.0 3.0 4.00 4.0
HourlyRate 1470.0 65.891156 20.329428 30.0 48.0 66.0 83.75 100.0
JobInvolvement 1470.0 2.729932 0.711561 1.0 2.0 3.0 3.00 4.0
JobLevel 1470.0 2.063946 1.106940 1.0 1.0 2.0 3.00 5.0
JobSatisfaction 1470.0 2.728571 1.102846 1.0 2.0 3.0 4.00 4.0
MonthlyIncome 1470.0 6502.931293 4707.956783 1009.0 2911.0 4919.0 8379.00 19999.0
MonthlyRate 1470.0 14313.103401 7117.786044 2094.0 8047.0 14235.5 20461.50 26999.0
NumCompaniesWorked 1470.0 2.693197 2.498009 0.0 1.0 2.0 4.00 9.0
PercentSalaryHike 1470.0 15.209524 3.659938 11.0 12.0 14.0 18.00 25.0
PerformanceRating 1470.0 3.153741 0.360824 3.0 3.0 3.0 3.00 4.0
RelationshipSatisfaction 1470.0 2.712245 1.081209 1.0 2.0 3.0 4.00 4.0
StockOptionLevel 1470.0 0.793878 0.852077 0.0 0.0 1.0 1.00 3.0
TotalWorkingYears 1470.0 11.279592 7.780782 0.0 6.0 10.0 15.00 40.0
TrainingTimesLastYear 1470.0 2.799320 1.289271 0.0 2.0 3.0 3.00 6.0
WorkLifeBalance 1470.0 2.761224 0.706476 1.0 2.0 3.0 3.00 4.0
YearsAtCompany 1470.0 7.008163 6.126525 0.0 3.0 5.0 9.00 40.0
YearsInCurrentRole 1470.0 4.229252 3.623137 0.0 2.0 3.0 7.00 18.0
YearsSinceLastPromotion 1470.0 2.187755 3.222430 0.0 0.0 1.0 3.00 15.0
YearsWithCurrManager 1470.0 4.123129 3.568136 0.0 2.0 3.0 7.00 17.0
  • Observations
    • All features have complete data (count = 1470).
    • MonthlyIncome and MonthlyRate are highly skewed → consider log transform.
    • PerformanceRating and StockOptionLevel show low variance → may be dropped.
    • YearsAtCompany, YearsSinceLastPromotion, and TotalWorkingYears show wide ranges → consider binning.
    • Satisfaction and involvement scores (EnvironmentSatisfaction, JobSatisfaction, etc.) are ordinal → treat accordingly.
    • NumCompaniesWorked = 0 may indicate first job → consider encoding.
In [13]:
# Creating histograms
df[num_cols].hist(figsize = (14,14))
plt.tight_layout()
No description has been provided for this image
  • Observatons
    • Right-skewed features:

      • DistanceFromHome, MonthlyIncome, MonthlyRate, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager → consider log transform or binning.
    • Low variance features:

      • PerformanceRating (mostly 3 or 4), StockOptionLevel (mostly 0 or 1) → may offer little predictive value.
    • Balanced or near-normal distributions:

      • Age, HourlyRate, DailyRate show fairly even spread.
    • Ordinal categorical features:

      • JobLevel, Education, WorkLifeBalance, JobInvolvement, and satisfaction scores show discrete peaks → treat as ordinal, not continuous.
    • Notable clustering:

      • NumCompaniesWorked has a spike at 0 → may indicate first-time employees.
      • TrainingTimesLastYear and PercentSalaryHike are centered on specific values.
    • Potential engineered features:

      • Grouping Years... or Income-related columns into bins may enhance model performance.

Univariate analysis for categorical variables¶

In [14]:
for i in cat_cols:
  print(df[i].value_counts(normalize = True))

  print('*' * 40)
Attrition
No     0.838776
Yes    0.161224
Name: proportion, dtype: float64
****************************************
BusinessTravel
Travel_Rarely        0.709524
Travel_Frequently    0.188435
Non-Travel           0.102041
Name: proportion, dtype: float64
****************************************
Department
Research & Development    0.653741
Sales                     0.303401
Human Resources           0.042857
Name: proportion, dtype: float64
****************************************
EducationField
Life Sciences       0.412245
Medical             0.315646
Marketing           0.108163
Technical Degree    0.089796
Other               0.055782
Human Resources     0.018367
Name: proportion, dtype: float64
****************************************
Gender
Male      0.6
Female    0.4
Name: proportion, dtype: float64
****************************************
JobRole
Sales Executive              0.221769
Research Scientist           0.198639
Laboratory Technician        0.176190
Manufacturing Director       0.098639
Healthcare Representative    0.089116
Manager                      0.069388
Sales Representative         0.056463
Research Director            0.054422
Human Resources              0.035374
Name: proportion, dtype: float64
****************************************
MaritalStatus
Married     0.457823
Single      0.319728
Divorced    0.222449
Name: proportion, dtype: float64
****************************************
OverTime
No     0.717007
Yes    0.282993
Name: proportion, dtype: float64
****************************************
  • Observations
    • Attrition: Highly imbalanced (Yes = 16.1%, No = 83.9%) → requires stratification or class weighting in modeling.
    • BusinessTravel: Majority travel rarely (70.9%); frequent travelers (18.8%) may correlate with higher attrition.
    • Department: Most employees are in R&D (65.4%), with fewer in Sales (30.3%) and HR (4.3%).
    • EducationField: Life Sciences (41.2%) and Medical (31.6%) dominate; Human Resources is rare (~1.8%).
    • Gender: Male-dominated (60%); may require fairness checks in modeling.
    • JobRole: Spread out, but Sales Execs and Research Scientists are the most common; HR is only 3.6%.
    • MaritalStatus: Majority are married (45.8%), followed by single (32%); divorce is less common.
    • OverTime: 28.0% of employees work overtime — a key factor often linked to attrition.

Bivariate and Multivariate analysis¶

In [15]:
#Crosstab
pd.crosstab(df['OverTime'], df['Attrition'], normalize = 'index')
Out[15]:
Attrition No Yes
OverTime
No 0.895636 0.104364
Yes 0.694712 0.305288
In [16]:
# How many employees do not work over time and have not left the company

df[(df['OverTime'] == 'No') & (df['Attrition'] == 'No')].shape
Out[16]:
(944, 31)
In [17]:
# #mployes who do not work over time but still left the company

df[(df['OverTime'] == 'No') & (df['Attrition'] == 'Yes')].shape
Out[17]:
(110, 31)
In [18]:
# Proportion of employees who do not work overtime, but still left the company
110/(944 + 110)
Out[18]:
0.10436432637571158
In [19]:
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(15, 10), sharey=True)

i = 0
for x in range(2):
    for y in range(4):
        if i < len(cat_cols):
            pd.crosstab(df[cat_cols[i]], df['Attrition'], normalize='index')\
              .mul(100).plot(kind='bar', stacked=True, ax=axes[x, y])
            axes[x, y].set_ylabel('Percentage Attrition %')
            axes[x, y].set_title(cat_cols[i])
            i += 1
        else:
            axes[x, y].axis('off')  # Hide empty subplot

plt.tight_layout()
plt.show()
No description has been provided for this image
  • Observations
    • OverTime: Strongest indicator — employees working overtime have much higher attrition.
    • BusinessTravel: Frequent travelers are more likely to leave than those who travel rarely or not at all.
    • JobRole: Sales Representatives and Laboratory Technicians show higher attrition; Managers and Directors show lower.
    • Department: Sales has higher attrition compared to R&D and HR.
    • MaritalStatus: Single employees are more likely to leave than married or divorced ones.
    • EducationField: Attrition appears fairly consistent across fields, with minor differences.
    • Gender: Slightly higher attrition among females, though the gap is small.

Relationship between attrition and Numerical variables

In [20]:
# The mean of numerical variables grouped by attrition
df.groupby(['Attrition'])[num_cols].mean()
Out[20]:
Age DailyRate DistanceFromHome Education EnvironmentSatisfaction HourlyRate JobInvolvement JobLevel JobSatisfaction MonthlyIncome ... PerformanceRating RelationshipSatisfaction StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
Attrition
No 37.561233 812.504461 8.915653 2.927007 2.771290 65.952149 2.770479 2.145985 2.778589 6832.739659 ... 3.153285 2.733982 0.845093 11.862936 2.832928 2.781022 7.369019 4.484185 2.234388 4.367397
Yes 33.607595 750.362869 10.632911 2.839662 2.464135 65.573840 2.518987 1.637131 2.468354 4787.092827 ... 3.156118 2.599156 0.527426 8.244726 2.624473 2.658228 5.130802 2.902954 1.945148 2.852321

2 rows × 23 columns

  • Observations
    • Age: Employees who left are younger (33.6 vs 37.6).
    • MonthlyIncome: Those who left earned significantly less (≈ $4.8k vs $6.8k).
    • JobLevel: Leavers tend to be in lower-level positions (1.64 vs 2.15).
    • TotalWorkingYears: Lower average experience for leavers (8.2 vs 11.9).
    • YearsWithCurrManager: Much lower for leavers (2.85 vs 4.37), possibly indicating weak leadership ties.
    • YearsInCurrentRole: Leavers had shorter tenure in role (2.9 vs 4.5).
    • JobInvolvement & Satisfaction Scores: Lower across the board for leavers (e.g., JobSatisfaction: 2.47 vs 2.78).
    • EnvironmentSatisfaction: Lower among those who left (2.46 vs 2.77).
    • DistanceFromHome: Higher for leavers (10.6 vs 8.9) — long commutes may affect attrition.
    • StockOptionLevel: Higher for those who stayed (0.85 vs 0.53) — equity may help retain talent.
  • Find out what kind of employees are leaving the company more

Relationship between different numerical variables

In [21]:
# Plotting the correlation between numerical variables
plt.figure(figsize = (15, 8))

mask = np.triu(df[num_cols].corr())
sns.heatmap(df[num_cols].corr(), annot = True, fmt = '0.2f', cmap = 'YlGnBu', mask=mask)
Out[21]:
<Axes: >
No description has been provided for this image
  • Observations
    • Strong Positive Correlations:

      • MonthlyIncome vs JobLevel (0.95)
      • YearsAtCompany vs YearsInCurrentRole (0.76)
      • YearsWithCurrManager vs YearsInCurrentRole (0.71)
      • TotalWorkingYears vs Age (0.68)
      • JobLevel vs TotalWorkingYears (0.78)
    • Moderate Positive Correlations:

      • MonthlyIncome vs TotalWorkingYears (0.62)
      • JobLevel vs MonthlyIncome (0.78)
    • Low or Negligible Correlations:

      • DailyRate, DistanceFromHome, HourlyRate, and PerformanceRating show weak or no significant correlation with other features.
      • TrainingTimesLastYear and WorkLifeBalance have almost no correlation with other variables.

Summary of EDA¶

  • Observations

    • Data Description

      • The dataset contains HR data on 1470 employees with 35 features, including:
        • Target: Attrition (Yes/No)
        • Numerical: Age, MonthlyIncome, YearsAtCompany, etc.
        • Categorical: JobRole, Department, Gender, BusinessTravel, etc.
        • Covers job satisfaction, performance, compensation, and tenure metrics.
    • Data Cleaning

      • Dropped constant or ID-like columns:
        • EmployeeCount, EmployeeNumber, Over18, StandardHours
      • Verified no missing values (count = 1470 for all columns).
      • Identified and handled class imbalance: only ~16% of employees left (Attrition == Yes).
      • Separated features into:
        • Numerical (23)
        • Categorical (7)
    • Observations from EDA

      • MonthlyIncome, MonthlyRate, TotalWorkingYears, DistanceFromHome are right-skewed.
      • PerformanceRating, StockOptionLevel show low variance.
      • Age and HourlyRate are evenly distributed.
      • JobLevel, Education, and satisfaction scores are ordinal and discrete.

Model Building - Approach¶

  1. Prepare the data for modeling.
  2. Partition the data into train and test sets.
  3. Build the model on the train data.
  4. Tune the model if required.
  5. Test the data on the test set.

Preparing data for modeling¶

In [22]:
#replace values in a pandas DataFrame

#Method 1
#df['OverTime'].replace({'Yes': 1, 'No': 0})

#Method 2
np.where(df['OverTime'] == 'Yes', 1, 0)
  # This checks for each value in the OverTime column if it is equal to 'Yes'
  # If the condition is True, it assigns 1, otherwise 0 is assigned
Out[22]:
array([1, 0, 1, ..., 1, 0, 0])
In [23]:
# Creating dummy variables for categorical Variables

# Creating the list of columns for which we need to create the dummy variables
to_get_dummies_for = ['BusinessTravel', 'Department', 'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender', 'JobInvolvement',
                      'JobLevel', 'JobRole', 'MaritalStatus']

# Creating dummy variables
df = pd.get_dummies(data = df, columns = to_get_dummies_for, drop_first = True)

# Mapping overtime and attrition
dict_OverTime = {'Yes': 1, 'No': 0} # just like above (created to map categorical string values ('Yes' and 'No') into numeric values (1 and 0)
dict_attrition = {'Yes': 1, 'No': 0}

df['OverTime'] = df.OverTime.map(dict_OverTime) # maps the 'OverTime' column's values ('Yes' to 1, 'No' to 0).
df['Attrition'] = df.Attrition.map(dict_attrition)
In [24]:
# Separating the independent variables (X) and the dependent variable (Y)

# Separating the target variable and other variables
Y = df.Attrition
X = df.drop(columns = ['Attrition']) #target
In [25]:
X.shape # training data
Out[25]:
(1470, 54)

Scaling Optoins and Ranking¶

Below, we are scaling the numerical variables in the dataset to have the same range. If we don't do this, then the model will be biased towards a variable where we have a higher range and the model will not learn from the variables with a lower range. There are many ways to do scaling. Here, we are using MinMaxScaler as we have both categorical and numerical variables in the dataset and don't want to change the dummy encodings of the categorical variables that we have already created. For more information on different ways of doing scaling, refer to the section 6.3.1 of this page here

In [26]:
data = df.copy()
In [27]:
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler,
    PowerTransformer, QuantileTransformer, Normalizer
)

# 📌 Load dataset (Modify path if needed)
#file_path = "your_dataset.csv"  # Update with correct file
#data = pd.read_csv(file_path)

# 📌 Select numeric columns only
numeric_features = data.select_dtypes(include=['number']).dropna()

# 📌 Detect Outliers using IQR (Interquartile Range)
def detect_outliers(df1):
    outlier_info = {}
    for column in df1.columns:
        Q1 = df1[column].quantile(0.25)
        Q3 = df1[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Count outliers
        outlier_count = ((df1[column] < lower_bound) | (df1[column] > upper_bound)).sum()
        outlier_info[column] = outlier_count

    return outlier_info

outliers = detect_outliers(numeric_features)
outlier_df1 = pd.DataFrame.from_dict(outliers, orient='index', columns=['Outlier Count'])

# 📌 Check Skewness (to determine if transformation is needed)
skewness = numeric_features.skew()

# 📌 Scaling Options with All Variants
scalers = {
    "StandardScaler": StandardScaler(),
    "MinMaxScaler": MinMaxScaler(),
    "RobustScaler": RobustScaler(),
    "MaxAbsScaler": MaxAbsScaler(),
    "PowerTransformer (Yeo-Johnson)": PowerTransformer(method="yeo-johnson"),
    "PowerTransformer (Box-Cox)": PowerTransformer(method="box-cox"),
    "QuantileTransformer (Normal)": QuantileTransformer(output_distribution="normal"),
    "QuantileTransformer (Uniform)": QuantileTransformer(output_distribution="uniform"),
    "Normalizer (L1)": Normalizer(norm="l1"),
    "Normalizer (L2)": Normalizer(norm="l2"),
    "Normalizer (Max)": Normalizer(norm="max"),
}

scaler_results = {}

# 📌 Try all scalers & transformations
for scaler_name, scaler in scalers.items():
    try:
        transformed_data = scaler.fit_transform(numeric_features)
        transformed_df1 = pd.DataFrame(transformed_data, columns=numeric_features.columns, index=numeric_features.index)

        # Calculate variance after scaling (Higher is better)
        cumulative_variance = transformed_df1.var().sum()
        scaler_results[scaler_name] = cumulative_variance

    except Exception as e:
        print(f"❌ Error with {scaler_name}: {e}")

# 📌 Convert results to DataFrame and Rank
scaler_rank_df1 = pd.DataFrame(list(scaler_results.items()), columns=['Scaler', 'Cumulative Variance'])
scaler_rank_df1 = scaler_rank_df1.sort_values(by="Cumulative Variance", ascending=False)

# 📌 Pick the best scaler based on variance ranking
best_scaler_name = scaler_rank_df1.iloc[0]["Scaler"]
best_scaler = scalers[best_scaler_name]
scaled_data = best_scaler.fit_transform(numeric_features)

# 📌 Convert Scaled Data to DataFrame
scaled_df1 = pd.DataFrame(scaled_data, columns=numeric_features.columns, index=numeric_features.index)

# 📌 Save the scaled dataset
scaled_df1.to_csv("/content/drive/MyDrive/My DS DA/Employee Attrition/scaled_dataset.csv", index=False)

# 📌 Explain Selection
explanations = {
    "MinMaxScaler": "Scales data between [0,1]. Used in Unsupervised Learning & Deep Learning but NOT recommended for outliers.",
    "StandardScaler": "Standardizes data to mean=0, variance=1. Recommended for Unsupervised Learning & Deep Learning. Works well with outliers.",
    "RobustScaler": "Uses median and IQR. Best choice for handling outliers in both Unsupervised and Supervised Learning.",
    "MaxAbsScaler": "Scales data between [-1,1]. Used for sparse data, not generally needed for typical datasets.",
    "PowerTransformer (Yeo-Johnson)": "Transforms data to be more Gaussian-like. Good for skewed data in Regression & Classification.",
    "PowerTransformer (Box-Cox)": "Similar to Yeo-Johnson but works only with strictly positive values. Reduces skewness.",
    "QuantileTransformer (Normal)": "Maps data to a normal distribution. Works well when distribution is unknown.",
    "QuantileTransformer (Uniform)": "Maps data to a uniform distribution. Useful for highly irregular distributions.",
    "Normalizer (L1)": "Normalizes each row by L1 norm. Good for text-based or sparse datasets.",
    "Normalizer (L2)": "Normalizes each row by L2 norm. Often used in clustering tasks.",
    "Normalizer (Max)": "Normalizes each row by its maximum absolute value. Good for text and sparse data."
}

# 📌 Print Summary
print("\n📊 Scaler Rankings:")
print(scaler_rank_df1)

print(f"\n🏆 Best Scaler Chosen: {best_scaler_name}")
print(f"📌 Reason: {explanations.get(best_scaler_name, 'No explanation available')}")

# 📌 Print Insights
if outlier_df1['Outlier Count'].sum() > 0:
    print("\n⚠️ Outliers detected! **RobustScaler** is recommended if handling them is critical.")
else:
    print("\n✅ No significant outliers detected. **StandardScaler** is the default recommendation.")

if any(abs(skewness) > 1):
    print("\n⚠️ Skewed data detected! **PowerTransformer**")
❌ Error with PowerTransformer (Box-Cox): The Box-Cox transformation can only be applied to strictly positive data

📊 Scaler Rankings:
                           Scaler  Cumulative Variance
5    QuantileTransformer (Normal)           135.841283
0                  StandardScaler            21.014295
4  PowerTransformer (Yeo-Johnson)            20.013615
2                    RobustScaler            11.679481
6   QuantileTransformer (Uniform)             2.197743
1                    MinMaxScaler             1.695346
3                    MaxAbsScaler             1.284149
9                Normalizer (Max)             0.135088
8                 Normalizer (L2)             0.098395
7                 Normalizer (L1)             0.060545

🏆 Best Scaler Chosen: QuantileTransformer (Normal)
📌 Reason: Maps data to a normal distribution. Works well when distribution is unknown.

⚠️ Outliers detected! **RobustScaler** is recommended if handling them is critical.

⚠️ Skewed data detected! **PowerTransformer**

Scaling the data¶

The independent variables in this dataset have different scales. When features have different scales from each other, there is a chance that a higher weightage will be given to features that have a higher magnitude, and they will dominate over other features whose magnitude changes may be smaller but whose percentage changes may be just as significant or even larger. This will impact the performance of our machine learning algorithm, and we do not want our algorithm to be biased towards one feature.

The solution to this issue is Feature Scaling, i.e. scaling the dataset so as to give every transformed variable a comparable scale.

In this problem, we will use the Standard Scaler method, which centers and scales the dataset using the Z-Score.

It standardizes features by subtracting the mean and scaling it to have unit variance.

In [28]:
# # Scaling the data
# sc = StandardScaler()

# X_scaled = sc.fit_transform(X)

# X_scaled = pd.DataFrame(X_scaled, columns = X.columns)

from sklearn.preprocessing import QuantileTransformer

# Scaling the data
sc = QuantileTransformer(output_distribution='normal')

X_scaled = sc.fit_transform(X)

X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

Splitting the data into 70% train and 30% test sets¶

Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance, there could be several times more negative samples than positive samples. In such cases, it is recommended to use the stratified sampling technique to ensure that relative class frequencies are approximately preserved in each train and validation fold.

In [29]:
# Splitting the data
x_train, x_test, y_train, y_test = train_test_split(X_scaled, Y, test_size = 0.3, random_state = 1, stratify = Y)

stratify=Y

  • data is split in such a way that the proportion of each class in the target variable Y is preserved in both the training and testing sets.
  • This is particularly useful in cases where the target variable is imbalanced (i.e., there are more instances of one class than another), ensuring that both sets have a similar distribution of classes.
    • ensures that the split maintains a similar class distribution in both the training and testing sets.

Check for Imbalanced Data¶

In [30]:
sns.countplot(data=df, x='Attrition', edgecolor = "black");
No description has been provided for this image
In [31]:
# Number of samples in the target variable. Original data
sum(Y == 0), sum(Y == 1)
Out[31]:
(1233, 237)
In [32]:
# Proportion of class "0"
round(2466/(2466+474) * 100, 2)
Out[32]:
83.88
In [33]:
# Number of samples in the target variable. Training data after split using "stratify"
sum(y_train == 0), sum(y_train == 1)
Out[33]:
(863, 166)
In [34]:
# Proportion of class "0"
round(1726/(1726+332) * 100 ,2)
Out[34]:
83.87
In [35]:
# Number of samples in the target variable. Training data after split without using "stratify"

_, _, y_train_no_strat, y_test_no_strat = train_test_split(X_scaled, Y, test_size = 0.3, random_state=1)

sum(y_train_no_strat == 0), sum(y_train_no_strat == 1)
Out[35]:
(869, 160)
In [36]:
# Proportion of class "0"
round(1737/(1737 + 321) *100, 2)
Out[36]:
84.4

By default, sklearn uses stratified sampling in order to ensure that relative class frequencies is approximately preserved in each train and test set. More information here.

Model evaluation criterion¶

Model Evaluation Notes – Choosing the Right Metric

Project Goal: Predict which employees are likely to leave (attrition = "Yes") to enable early intervention and improve retention.


Recommended Evaluation Metric: Recall (for Attrition = Yes)

  • Definition: Recall = True Positives / (True Positives + False Negatives)
  • Reason: Measures how many actual leavers the model successfully identifies.
  • Why it's important:
    • Missing a potential leaver (false negative) is costly for HR and business.
    • Helps proactively retain valuable employees before they exit.

Secondary Metric: F1-Score (for Attrition = Yes)

  • Definition: F1 = 2 × (Precision × Recall) / (Precision + Recall)
  • Use when: You want to balance between:
    • Catching attrition (high recall)
    • Avoiding too many false alerts (precision)

Avoid using accuracy alone:

  • Dataset is imbalanced (Attrition = Yes is only ~16%)
  • High accuracy can be misleading if the model mostly predicts "No"

Summary:

  • Primary focus: Recall (Yes) — don't miss actual leavers
  • Secondary: F1-score (Yes) — balance between false positives and false negatives

The model can make two types of wrong predictions:

  1. Predicting an employee will attrite when the employee doesn't attrite (FP)
  2. Predicting an employee will not attrite when the employee actually attrites (FN)
  • Need to reduce the FN (important for company to reduce)
  • Need to know why people are leaving
  • Need to reduce this

Which case is more important?

  • Predicting that the employee will not attrite but the employee attrites, i.e., losing out on a valuable employee or asset. This would be considered a major miss for any employee attrition predictor and is hence the more important case of wrong predictions.

How to reduce this loss i.e the need to reduce False Negatives?

  • The company would want the Recall to be maximized, the greater the Recall, the higher the chances of minimizing false negatives. Hence, the focus should be on increasing the Recall (minimizing the false negatives) or, in other words, identifying the true positives (i.e. Class 1) very well, so that the company can provide incentives to control the attrition rate especially, for top-performers. This would help in optimizing the overall project cost towards retaining the best talent.

  • Recall (Sensitivity/TP Rate)

    • Use Case:
      • when missing a positive instance is more critical
        • e.g., identifying fraud, diagnosing disease
      • important when you want to minimize false negatives (FN) by maximizing recall

Also, let's create a function to calculate and print the classification report and the confusion matrix so that we don't have to rewrite the same code repeatedly for each model.

Range in recall is 0 t0 1

  • Value closest to one is better, hence train with .98 is better
In [37]:
def metrics_score(actual, predicted):

    print(classification_report(actual, predicted))

    # Confusion matrix
    cm = confusion_matrix(actual, predicted)

    # Plot confusion matrix
    plt.figure(figsize = (7, 4))

    sns.heatmap(cm, annot = True, fmt = '.2f', xticklabels = ['Not Attrite', 'Attrite'], yticklabels = ['Not Attrite', 'Attrite'])

    plt.ylabel('Actual')
    plt.xlabel('Predicted')

    plt.show()
In [38]:
def model_performance_classification(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier

    predictors: independent variables

    target: dependent variable
    """

    # Predicting using the independent variables
    pred = model.predict(predictors)

    recall = recall_score(target, pred,average = 'macro')                 # To compute recall

    precision = precision_score(target, pred, average = 'macro')              # To compute precision

    acc = accuracy_score(target, pred)                                 # To compute accuracy score


    # Creating a dataframe of metrics

    df_perf = pd.DataFrame(
        {
            "Precision":  precision,
            "Recall":  recall,
            "Accuracy": acc,
        },

        index = [0],
    )

    return df_perf

Building the model¶

Models

  • Decision Tree
  • Random Forest

Building a Decision Tree Model¶

In [41]:
from sklearn.tree import DecisionTreeClassifier

# Building decision tree model
# dt = DecisionTreeClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1)

dt_no_weights = DecisionTreeClassifier(random_state = 1)

dt_weights = DecisionTreeClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1) # to give more importance to the class with the lowest percentage
# class weight (hyperparameter) is useful when dealing with unbalanced/imbalanced data: see cell 74 above (sns.countplot(data=df, x='Attrition', edgecolor = 'black')

# cell 50: Percentage of Label 0 (Not Attriate): 83.88. Percentage of Label 1 (Attritiate): 16.12 = 0.83 given to class with lowest percentage = class 1 !!!!!!!!!!!
  # flipped/inversed it
  # for unbalanced/imbalanced data
  # can be optimized but instructor stated better to get form distribution (cell 74)

supervised learning classification tasks with unbalanced/imbalanced data, optimizing your model to account for class imbalance

  • Resampling Techniques
    • Oversampling Minority Class
    • Undersampling Majority Class
    • Combination of Both
    • Class Weight Adjustment
    • Adjust class_weight parameter (as you have done) to give more importance to the minority class. (what we did)
  • Data Augmentation Techniques
    • Rotation
    • Flipping
    • Scaling
    • Cropping
  • Use of Different Metrics
    • F1-Score
    • Precision-Recall Curve
    • ROC-AUC Score
    • Confusion Matrix
  • Ensemble Learning Methods
    • Random Forest
    • XGBoost Parameter scale_pos_weight
  • Threshing Tuning
  • Stratified Sampling
  • Custom Loss Functions
    • Weighted Cross-Entropy
    • Focal Loss
  • Advanced Techniques
    • Cost-Sensitive Learning
    • Two-Stage Models
    • Anomaly Detection Approaches
In [38]:
# # Fitting decision tree model
# dt.fit(x_train, y_train)
In [42]:
# Fitting decision tree model
dt_no_weights.fit(x_train, y_train)
Out[42]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
In [43]:
# Fitting decision tree model
dt_weights.fit(x_train, y_train)
Out[43]:
DecisionTreeClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)

Let's check the model performance of decision tree

In [44]:
# Checking performance on the training dataset
y_train_pred_dt = dt_weights.predict(x_train)

metrics_score(y_train, y_train_pred_dt)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       863
           1       1.00      1.00      1.00       166

    accuracy                           1.00      1029
   macro avg       1.00      1.00      1.00      1029
weighted avg       1.00      1.00      1.00      1029

No description has been provided for this image
  • Observations
    • The decision tree model shows perfect performance on the training set with 100% precision, recall, and F1-score for both classes.
    • The confusion matrix confirms no misclassifications. This is a strong indicator of overfitting, as real-world data rarely allows for such ideal separation.
    • Evaluation on the test set is needed to assess generalizability.

macro refers to the method used to average the metric across multiple classes in a multi-class classification problem

Weighted Averaging

  • weights each class by its support (number of samples in the class)
  • Gives more importance to larger classes.

Macro Averaging

  • Calculates the metric (e.g., precision, recall) independently for each class and then takes the average of these values.
  • Useful when you want to treat all classes equally, even if some classes have fewer samples.
In [45]:
# Checking performance on the test dataset
y_test_pred_dt = dt_weights.predict(x_test)

metrics_score(y_test, y_test_pred_dt)
              precision    recall  f1-score   support

           0       0.88      0.85      0.86       370
           1       0.34      0.41      0.37        71

    accuracy                           0.78       441
   macro avg       0.61      0.63      0.62       441
weighted avg       0.79      0.78      0.78       441

No description has been provided for this image

Observations – Decision Tree (Test Set)

  • On the test set, the model's performance drops significantly, especially for the minority class.
  • Recall for class 1 (Attrite) is just 0.41, and precision is 0.34, indicating a high number of false positives.
  • While overall accuracy is 78%, the model struggles with correctly identifying attrition cases. This confirms overfitting observed in the training set.

Decision Tree Model – Train vs Test Comparison (with Class Weights)

Metric Train Set Test Set
Accuracy 1.00 0.78
Precision (Attrite) 1.00 0.34
Recall (Attrite) 1.00 0.41
F1-score (Attrite) 1.00 0.37
Macro Avg F1 1.00 0.63
Weighted Avg F1 1.00 0.78

Observations

  • The Decision Tree model with class weights achieves perfect performance on the training data, indicating overfitting.
  • On the test set, the recall for the minority class (Attrite) is only 0.41, suggesting that the model fails to generalize well.
  • While class weighting improves sensitivity to attrition, the drop in test set metrics reveals poor generalization, making the model unreliable in production without pruning or regularization.
In [98]:
# performance_classification with training data

model_performance_classification(dt_weights, x_train, y_train)
Out[98]:
Precision Recall Accuracy
0 1.0 1.0 1.0
In [99]:
# dtree_test = model_performance_classification(dt,x_test,y_test)
dtree_test = model_performance_classification(dt_weights,x_test,y_test)
dtree_test
Out[99]:
Precision Recall Accuracy
0 0.60945 0.627198 0.77551

Let's plot the feature importance and check the most important features.

In [46]:
importances = dt_weights.feature_importances_
importances
Out[46]:
array([3.79936593e-02, 6.82162389e-02, 8.10343094e-02, 5.09446909e-02,
       4.39879277e-02, 1.01489495e-01, 6.73985155e-02, 1.13790091e-02,
       6.06921082e-02, 2.03138490e-02, 1.08217662e-15, 2.01208561e-02,
       8.23041683e-02, 3.20681971e-02, 1.67164389e-02, 1.22094076e-02,
       6.53640377e-02, 1.70946813e-02, 1.57082809e-02, 1.43828341e-02,
       1.97231699e-16, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       7.67886455e-03, 0.00000000e+00, 2.19606906e-03, 0.00000000e+00,
       1.72890644e-02, 0.00000000e+00, 1.16721790e-17, 7.71928139e-03,
       9.95701923e-03, 2.54090092e-02, 2.19240018e-02, 1.38341564e-02,
       1.26714609e-02, 1.05208205e-02, 0.00000000e+00, 1.96886075e-03,
       0.00000000e+00, 5.80786172e-03, 1.67413487e-03, 0.00000000e+00,
       3.81392623e-03, 8.66106155e-03, 0.00000000e+00, 3.62580696e-03,
       3.45506258e-03, 1.16862326e-02, 3.47014779e-04, 1.03415854e-02,
       0.00000000e+00, 0.00000000e+00])
In [47]:
# Plot the feature importance

importances = dt_weights.feature_importances_

columns = X.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

plt.figure(figsize = (13, 13))

sns.barplot(x=importance_df.Importance,y=importance_df.index, palette = 'rocket');
No description has been provided for this image

Feature Importance – Decision Tree Classifier (with Class Weights)

  • MonthlyIncome is the most influential feature, indicating compensation is strongly linked to attrition risk.
  • StockOptionLevel and DistanceFromHome are also major contributors; employees with fewer stock options or longer commutes are more likely to leave.
  • Compensation-related features like DailyRate, MonthlyRate, and HourlyRate appear consistently high, emphasizing financial influence.
  • OverTime and YearsAtCompany show that workload and tenure play meaningful roles in predicting attrition.
  • JobSatisfaction and Age contribute moderately, suggesting satisfaction and career stage matter.
  • Work-life balance factors such as WorkLifeBalance, TrainingTimesLastYear, and YearsSinceLastPromotion have noticeable but lower influence.
  • Categorical variables (e.g., JobRole, EducationField, Gender) appear with lower importance, meaning role or gender-specific patterns are less critical in this model.
  • Overall, financial incentives, commute, and experience-related variables dominate the decision-making in the model.

Let's try to tune the model and check if we could improve the results.

Tuning the Decision Tree Classifier using GridSearch¶

  • Hyperparameter tuning is tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. We'll use Grid search to perform hyperparameter tuning.
  • Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
  • It is an exhaustive search that is performed on the specific parameter values of a model.
  • The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Criterion{“gini”, “entropy”}

  • The function is to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

max_depth

  • The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leaf

  • The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
In [55]:
# from sklearn import metrics

# %%time
# # Choose the type of classifier
# dtree_tuned = DecisionTreeClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1)

# # Grid of parameters to choose from
# parameters = {'max_depth': np.arange(2, 7), # should test a wide range of values 2, 15, 50, 100
#               'criterion': ['gini', 'entropy'],
#               'min_samples_leaf': [5, 10, 20, 25]}

# # Type of scoring used to compare parameter combinations.
# # "pos_label": It allows you to specify which class label should be considered as the positive class when calculating the scoring metric.
# # By default, pos_label = 1, but you can change it based on your specific use case.
# scorer = metrics.make_scorer(recall_score, pos_label = 1)

# # Grid search object
# gridCV = GridSearchCV(dtree_tuned, parameters, scoring = scorer, cv = 10) # cross validation, GridSearchCV to tune hyperparameters

# # Fitting the grid search on the train data
# gridCV = gridCV.fit(x_train, y_train)

# # Set the classifier to the best combination of parameters
# dtree_tuned = gridCV.best_estimator_

# # Fit the best estimator to the data
# dtree_tuned.fit(x_train, y_train)
In [59]:
from sklearn import metrics
from sklearn.metrics import recall_score

# Choose the type of classifier
dtree_tuned = DecisionTreeClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)

# Grid of parameters to choose from
parameters = {
    'max_depth': np.arange(2, 7),
    'criterion': ['gini', 'entropy'],
    'min_samples_leaf': [5, 10, 20, 25]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score, pos_label=1)

# Grid search object
gridCV = GridSearchCV(dtree_tuned, parameters, scoring=scorer, cv=10)

# Fit the grid search on the train data
gridCV = gridCV.fit(x_train, y_train)

# Set the classifier to the best combination of parameters
dtree_tuned = gridCV.best_estimator_

# Fit the best estimator to the data
dtree_tuned.fit(x_train, y_train)
Out[59]:
DecisionTreeClassifier(class_weight={0: 0.17, 1: 0.83}, max_depth=np.int64(2),
                       min_samples_leaf=5, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.17, 1: 0.83}, max_depth=np.int64(2),
                       min_samples_leaf=5, random_state=1)
In [60]:
# Show attributes in the object
gridCV.best_estimator_.__dict__
Out[60]:
{'criterion': 'gini',
 'splitter': 'best',
 'max_depth': np.int64(2),
 'min_samples_split': 2,
 'min_samples_leaf': 5,
 'min_weight_fraction_leaf': 0.0,
 'max_features': None,
 'max_leaf_nodes': None,
 'random_state': 1,
 'min_impurity_decrease': 0.0,
 'class_weight': {0: 0.17, 1: 0.83},
 'ccp_alpha': 0.0,
 'monotonic_cst': None,
 'feature_names_in_': array(['Age', 'DailyRate', 'DistanceFromHome', 'HourlyRate',
        'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate',
        'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike',
        'PerformanceRating', 'RelationshipSatisfaction',
        'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
        'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
        'YearsSinceLastPromotion', 'YearsWithCurrManager',
        'BusinessTravel_Travel_Frequently', 'BusinessTravel_Travel_Rarely',
        'Department_Research & Development', 'Department_Sales',
        'Education_2', 'Education_3', 'Education_4', 'Education_5',
        'EducationField_Life Sciences', 'EducationField_Marketing',
        'EducationField_Medical', 'EducationField_Other',
        'EducationField_Technical Degree', 'EnvironmentSatisfaction_2',
        'EnvironmentSatisfaction_3', 'EnvironmentSatisfaction_4',
        'Gender_Male', 'JobInvolvement_2', 'JobInvolvement_3',
        'JobInvolvement_4', 'JobLevel_2', 'JobLevel_3', 'JobLevel_4',
        'JobLevel_5', 'JobRole_Human Resources',
        'JobRole_Laboratory Technician', 'JobRole_Manager',
        'JobRole_Manufacturing Director', 'JobRole_Research Director',
        'JobRole_Research Scientist', 'JobRole_Sales Executive',
        'JobRole_Sales Representative', 'MaritalStatus_Married',
        'MaritalStatus_Single'], dtype=object),
 'n_features_in_': 54,
 'n_outputs_': 1,
 'classes_': array([0, 1]),
 'n_classes_': np.int64(2),
 'max_features_': 54,
 'tree_': <sklearn.tree._tree.Tree at 0x7d97502b30c0>}
In [61]:
# Checking performance on the TRAINING DATASET
y_train_pred_dt = dtree_tuned.predict(x_train)

metrics_score(y_train, y_train_pred_dt)
              precision    recall  f1-score   support

           0       0.91      0.63      0.74       863
           1       0.25      0.66      0.37       166

    accuracy                           0.63      1029
   macro avg       0.58      0.64      0.56      1029
weighted avg       0.80      0.63      0.68      1029

No description has been provided for this image

Decision Tree (GridSearchCV Tuned) – Training Set Observations

  • The grid search optimized the Decision Tree classifier with max_depth=4, min_samples_leaf=5, and class weight favoring the minority class (Attrite).
  • Training recall for the attrition class (1) improved to 0.66, indicating better sensitivity to identifying employees at risk of leaving.
  • However, precision for class 1 dropped significantly to 0.25, reflecting a higher false positive rate.
  • The confusion matrix shows 109 true positives and 57 false negatives for attrition, but also 319 false positives, suggesting some overfitting to class 1.
  • Overall training accuracy dropped to 0.63, indicating the tuned model prioritizes recall over precision—useful for early risk flagging in retention strategies.
In [62]:
# Checking performance on the TEST SET
y_test_pred_dt = dtree_tuned.predict(x_test)

metrics_score(y_test, y_test_pred_dt)
              precision    recall  f1-score   support

           0       0.89      0.57      0.70       370
           1       0.22      0.63      0.33        71

    accuracy                           0.58       441
   macro avg       0.56      0.60      0.51       441
weighted avg       0.78      0.58      0.64       441

No description has been provided for this image

Comparison: Decision Tree (Default Weights) vs GridSearchCV-Tuned Decision Tree

Metric Default DT (Train) Default DT (Test) Tuned DT (Train) Tuned DT (Test)
Accuracy 1.00 0.78 0.63 0.58
Precision (Attrite) 1.00 0.34 0.25 0.22
Recall (Attrite) 1.00 0.41 0.66 0.63
F1-score (Attrite) 1.00 0.37 0.37 0.32
Macro Avg F1 1.00 0.63 0.56 0.58
Weighted Avg F1 1.00 0.78 0.65 0.64

bservations

  • The default decision tree severely overfits the training data with perfect precision, recall, and accuracy (all 1.00), but fails to generalize well on test data.
  • Its recall on test set is 0.41, meaning it misses nearly 60% of actual attrition cases.
  • Tuned decision tree via GridSearchCV significantly improves recall on both train (0.66) and test (0.63), indicating better sensitivity to the positive class (Attrite).
  • However, this gain in recall comes at the cost of precision (0.22 test), indicating the model makes more false positive predictions.
  • The tuned model generalizes better, with a smaller performance gap between train and test sets.
  • F1-score for attrition is similar between models on test data (0.37 default vs. 0.32 tuned), but the tuned model better balances recall and avoids overfitting.

Recommendation

For HR use cases where identifying potential attrition early is more critical than avoiding false positives, the GridSearchCV-tuned decision tree is preferred due to its improved recall and generalizability. Precision can be improved further with post-modeling steps like threshold tuning or ensemble methods.

In [64]:
from sklearn.metrics import precision_score, recall_score, accuracy_score

# TEST DATA
dtree_tuned_test = model_performance_classification(dtree_tuned,x_test,y_test)
dtree_tuned_test
Out[64]:
Precision Recall Accuracy
0 0.556216 0.603388 0.582766

Output metrics BEFORE tuning the model

In [65]:
temp = pd.concat([model_performance_classification(dt_weights, x_train, y_train),
                  model_performance_classification(dt_weights, x_test, y_test)], axis=0)

temp.index = ['Training dataset', 'Test dataset']

temp
Out[65]:
Precision Recall Accuracy
Training dataset 1.00000 1.000000 1.00000
Test dataset 0.60945 0.627198 0.77551

Output metrics AFTER tuning the model. This model is not overfitting the training data

In [66]:
temp = pd.concat([model_performance_classification(dtree_tuned, x_train, y_train),
                  model_performance_classification(dtree_tuned, x_test, y_test)], axis=0)

temp.index = ['Training dataset', 'Test dataset']

temp
Out[66]:
Precision Recall Accuracy
Training dataset 0.579915 0.643493 0.634597
Test dataset 0.556216 0.603388 0.582766

Let's look at the feature importance of this model and try to analyze why this is happening.

In [67]:
# Feature importance
importances = dtree_tuned.feature_importances_

# Rename columns
columns = X.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

# Plot
plt.figure(figsize = (13, 13))
sns.barplot(x= importance_df.Importance, y= importance_df.index);
No description has been provided for this image

Feature Importance – Tuned Decision Tree

  • The top contributing feature in the tuned decision tree model is StockOptionLevel, followed closely by OverTime, both of which have significantly higher importance scores than other features.
  • DistanceFromHome, HourlyRate, and Age are also notable contributors, though with smaller impact.
  • The high importance of StockOptionLevel and OverTime aligns with previous SHAP and coefficient analyses, reinforcing their relevance in attrition prediction.
  • MonthlyIncome, JobSatisfaction, and NumCompaniesWorked have moderate influence, consistent with HR intuition about employee satisfaction and engagement.
  • This insight helps HR teams prioritize key employee variables when designing retention strategies.

Let's plot the tree and check if the assumptions about overtime income.

As we know the decision tree keeps growing until the nodes are homogeneous, i.e., it has only one class, and the dataset here has a lot of features, it would be hard to visualize the whole tree with so many features. Therefore, we are only visualizing the tree up to max_depth = 4.

  1. Decision Tree of model without weights Unweightsed in workbook for visualization and understanding
In [69]:
from sklearn import tree

features = list(X.columns)

plt.figure(figsize = (15, 10), dpi=300)

# "node_ids": Show the ID number on each node.
# "class_names": Names of each of the target classes in ascending numerical order
tree.plot_tree(dt_no_weights, max_depth = 2, feature_names = features, filled = True, fontsize = 12, node_ids = True, class_names = True)

plt.show()
No description has been provided for this image

Decision Tree Root Node Summary

  • The root node of the decision tree splits on the feature StockOptionLevel, with the condition:
    StockOptionLevel <= -2.431.

  • This node contains all 1029 samples in the dataset:

    • Class 0 (Not Attrite): 863
    • Class 1 (Attrite): 166
  • The Gini impurity at the root is 0.271, which suggests moderate impurity and class separation.

  • Left branch (True) proceeds with employees having StockOptionLevel <= -2.431, and is further split by:

    • YearsAtCompany <= -1.118 → Age <= -0.27 and OverTime <= 0.0
  • Right branch (False) with higher StockOptionLevel continues split on:

    • MonthlyIncome <= -0.884, and further by OverTime and YearsAtCompany

These splits suggest that StockOptionLevel, YearsAtCompany, OverTime, and MonthlyIncome are key decision drivers in predicting attrition.

In [70]:
# Number of training samples
x_train.shape[0]
Out[70]:
1029
  • Number of samples of each class before split.
In [71]:
# Number of samples of label 0 (No attrition)
y_train[y_train == 0].shape[0]
Out[71]:
863
In [72]:
# Number of samples of label 1 (Attrition)
y_train[y_train == 1].shape[0]
Out[72]:
166
  • class = y[0]: Label.
In [73]:
# Number of samples where feature "OverTime" is lower than or equal to 0.5
x_train[x_train.OverTime <= 0.5].shape[0]
Out[73]:
757
In [74]:
# Number of samples where feature "OverTime" is greater than 0.5
x_train[x_train.OverTime > 0.5].shape[0]
Out[74]:
272
In [75]:
757 + 272
Out[75]:
1029

Decision Tree of model with weights

In [76]:
features = list(X.columns)

plt.figure(figsize = (22, 12), dpi=300)

tree.plot_tree(dt_weights, max_depth = 3, feature_names = features, filled = True, fontsize = 12, node_ids = True, class_names = True)

plt.show()
No description has been provided for this image
  • Observationss

    • The root node of the decision tree (Node #0) splits based on StockOptionLevel <= -2.431, which is the most important feature for initial decision-making.
    • The tree reaches a maximum depth of 3 and incorporates important splits using features such as OverTime, YearsAtCompany, MonthlyIncome, and DistanceFromHome.
    • Left branches (e.g., Node #1 and Node #2) tend to lean toward predicting attrition (class y[1]) when conditions related to lower income or fewer years at company are met.
    • Right branches (e.g., Node #196 and Node #197) reflect higher MonthlyIncome or YearsAtCompany, classifying more as non-attrition (class y[0]).
    • The model shows how compensation and overtime behavior are strongly linked to attrition likelihood.
    • Gini impurity values across nodes range from 0.139 to 0.5, indicating fairly decent class separation at deeper nodes.
    • Node #2 and #117 specifically show good splits toward predicting attrition, based on Age, JobSatisfaction, and MonthlyIncome.
    • Class weights applied (0: 0.17, 1: 0.83) effectively shift the model's learning to prioritize the minority class (Attrited), as reflected in the structure and outcomes of the decision tree.

Decision Tree of model with weights and tuned

In [77]:
features = list(X.columns)

plt.figure(figsize = (15, 10), dpi=300)

tree.plot_tree(dtree_tuned, max_depth = 4, feature_names = features, filled = True, fontsize = 12, node_ids = True, class_names = True)

plt.show()
No description has been provided for this image
  • Observatons

    • The root node is based on StockOptionLevel <= -2.431, splitting all 1029 samples (863 not attrited, 166 attrited) with a Gini index of 0.5. The class distribution is relatively balanced, leading to an initial split.
    • The left branch (True) leads to node #1, which splits on OverTime <= 0.0 and classifies toward attrition (y[1]). This branch contains 428 samples with a better Gini index of 0.469 and a stronger presence of attrition cases.
    • The right branch (False) routes to node #4, which also splits on OverTime <= 0.0, but favors the non-attrition class (y[0]), with 601 samples and a Gini of 0.448. This indicates OverTime is a strong recurring splitter.
    • Further splits in node #1 reveal:
      • Node #2 has a Gini of 0.499 and samples almost evenly split (43.86 vs 48.14), making it an uncertain prediction region.
      • Node #3 improves separation with Gini = 0.316 and shows a stronger bias toward attrition (42.33 vs 10.37).
    • On the right side (node #4), node #5 has a Gini of 0.387 and maintains clear separation toward non-attrition (69.87 vs 24.9), while node #6 is balanced (Gini = 0.5), showing weaker separation.
    • The repeated appearance of StockOptionLevel and OverTime in upper splits confirms their predictive strength, as previously validated in feature importance charts.

    Overall, the tree structure shows improved class balance and separation in the tuned model. It reflects meaningful splits aligned with key HR features and supports interpretation for business application.

Random Forest¶

Building the Random Forest Classifier Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample, a decision tree makes a prediction.

The results from all the decision trees are combined and the final prediction is made using voting (for classification problems) or averaging (for regression problems).

In [79]:
from sklearn.ensemble import RandomForestClassifier

# Fitting the Random Forest classifier on the training data
rf_estimator = RandomForestClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1)

rf_estimator.fit(x_train, y_train)
Out[79]:
RandomForestClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)
In [80]:
# Checking performance on the training data
y_pred_train_rf = rf_estimator.predict(x_train)

metrics_score(y_train, y_pred_train_rf)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       863
           1       1.00      1.00      1.00       166

    accuracy                           1.00      1029
   macro avg       1.00      1.00      1.00      1029
weighted avg       1.00      1.00      1.00      1029

No description has been provided for this image
In [81]:
# Checking performance on the testing data
y_pred_test_rf = rf_estimator.predict(x_test)

metrics_score(y_test, y_pred_test_rf)
              precision    recall  f1-score   support

           0       0.85      0.99      0.92       370
           1       0.78      0.10      0.17        71

    accuracy                           0.85       441
   macro avg       0.81      0.55      0.55       441
weighted avg       0.84      0.85      0.80       441

No description has been provided for this image

Model Comparison Summary

Model Dataset Precision (1) Recall (1) F1-Score (1) Accuracy
Decision Tree (Weighted) Train 1.00 1.00 1.00 1.00
Test 0.34 0.41 0.37 0.78
DT Tuned (GridSearchCV) Train 0.25 0.66 0.37 0.65
Test 0.22 0.63 0.32 0.64
Random Forest (Weighted) Train 1.00 1.00 1.00 1.00
Test 0.10 0.78 0.17 0.85

Observations

  • The Decision Tree (Weighted) model performs perfectly on the training set but significantly drops in performance on the test set, indicating overfitting.
  • Tuned Decision Tree (via GridSearchCV with recall scoring) slightly reduces training performance to reduce overfitting, but it still struggles on the test set with both low precision and F1-score.
  • Random Forest performs excellently on training data (again, overfit), but its test performance is poor for the minority class (Recall = 0.10), despite overall accuracy being high (due to majority class dominance).
  • The precision-recall trade-off highlights that all models have difficulty generalizing well for minority (Attrite = 1) predictions, common in imbalanced classification problems.
In [82]:
# Checking performance on the TRAINING SET
model_performance_classification(rf_estimator, x_train, y_train)
Out[82]:
Precision Recall Accuracy
0 1.0 1.0 1.0
In [83]:
# Checking performance on the TEST SET
rf_estimator_test = model_performance_classification(rf_estimator,x_test,y_test)
rf_estimator_test
Out[83]:
Precision Recall Accuracy
0 0.814815 0.546593 0.85034

Random Forest Model Performance Observations

  • The training performance shows perfect scores: Precision = 1.0, Recall = 1.0, and Accuracy = 1.0, indicating the model has completely overfit the training data.
  • On the test set, while the overall accuracy remains high at 85.0%, the recall for the positive class (Attrition) drops to 0.55, and precision is around 0.81.
  • This performance gap suggests that the model memorizes the training data rather than generalizing well to unseen examples.
  • The slightly improved recall (compared to previous RF results) still isn’t sufficient for sensitive attrition detection, where missing a potential leaver is costly.

Let's check the feature importance of the Random Forest

In [84]:
importances = rf_estimator.feature_importances_

columns = X.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

plt.figure(figsize = (13, 13))

sns.barplot(x= importance_df.Importance, y=importance_df.index);
No description has been provided for this image

Feature Importance from Random Forest

  • The Random Forest model identifies MonthlyIncome, MonthlyRate, Age, and DailyRate as the top predictors of employee attrition.
  • YearsAtCompany, HourlyRate, and StockOptionLevel also contribute substantially to the model's predictions.
  • Several previously observed impactful features like OverTime, JobSatisfaction, and EnvironmentSatisfaction still show importance but with lower weights compared to income-related variables.
  • This suggests the Random Forest model prioritizes compensation-related features more heavily than role or engagement indicators when making predictions.

Tuning the Random Forest classifier using GridSearch¶

n_estimators: The number of trees in the forest.

min_samples_split: The minimum number of samples required to split an internal node.

min_samples_leaf: The minimum number of samples required to be at a leaf node.

max_features{“auto”, “sqrt”, “log2”, 'None'}: The number of features to consider when looking for the best split.

  • If “auto”, then max_features=sqrt(n_features).

  • If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).

  • If “log2”, then max_features=log2(n_features).

  • If None, then max_features=n_features.

In [85]:
%%time
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(class_weight = {0: 0.17, 1: 0.83}, random_state = 1)

# Grid of parameters to choose from
params_rf = {"n_estimators": [100, 250, 500],
             "min_samples_leaf": np.arange(1, 4, 1),
             "max_features": [0.7, 0.9, 'auto']} # max features: number of features to randomly select from data set (.7 is 70% of features to 90%)

# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, params_rf, scoring = scorer, cv = 5)

grid_obj = grid_obj.fit(x_train, y_train)

# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
CPU times: user 3min 11s, sys: 271 ms, total: 3min 11s
Wall time: 3min 34s
In [86]:
rf_estimator_tuned.fit(x_train, y_train)
Out[86]:
RandomForestClassifier(class_weight={0: 0.17, 1: 0.83}, max_features=0.7,
                       min_samples_leaf=np.int64(3), n_estimators=250,
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight={0: 0.17, 1: 0.83}, max_features=0.7,
                       min_samples_leaf=np.int64(3), n_estimators=250,
                       random_state=1)
In [87]:
# Checking performance on the training data
y_pred_train_rf_tuned = rf_estimator_tuned.predict(x_train)

metrics_score(y_train, y_pred_train_rf_tuned)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       863
           1       0.99      1.00      1.00       166

    accuracy                           1.00      1029
   macro avg       1.00      1.00      1.00      1029
weighted avg       1.00      1.00      1.00      1029

No description has been provided for this image
In [88]:
# Checking performance on the test data
y_pred_test_rf_tuned = rf_estimator_tuned.predict(x_test)

metrics_score(y_test, y_pred_test_rf_tuned)
              precision    recall  f1-score   support

           0       0.89      0.96      0.92       370
           1       0.62      0.37      0.46        71

    accuracy                           0.86       441
   macro avg       0.75      0.66      0.69       441
weighted avg       0.84      0.86      0.85       441

No description has been provided for this image
In [89]:
# Checking performance on the TRAINING SET
model_performance_classification(rf_estimator_tuned, x_train, y_train)
Out[89]:
Precision Recall Accuracy
0 0.997006 0.999421 0.999028
In [90]:
# Checking performance on the TEST SET
rf_estimator_tuned_test = model_performance_classification(rf_estimator_tuned, x_test, y_test)
rf_estimator_tuned_test
Out[90]:
Precision Recall Accuracy
0 0.753133 0.661477 0.861678
  • Observations
Model Type Dataset Precision Recall Accuracy
RF Untuned Training 1.000 1.000 1.000
Test 0.814 0.547 0.850
RF Tuned (GridSearch) Training 0.997 0.999 0.999
Test 0.753 0.661 0.862

Observations

  • Overfitting is evident in the untuned RF model: it achieves perfect training metrics but significantly lower test recall (54.7%).
  • Tuning improved generalization slightly, especially in recall (from 0.547 to 0.661), though it still underperforms in minority class detection.
  • Accuracy remains high for both models due to class imbalance, but recall for class 1 (attrition) is more critical for business use cases.
  • The tuned model better balances precision and recall on the test set, suggesting it is a more stable option for deployment despite a small drop in precision.
In [91]:
# Plotting feature importance
importances = rf_estimator_tuned.feature_importances_

columns = X.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

plt.figure(figsize = (13, 13))

sns.barplot(x= importance_df.Importance, y= importance_df.index)
Out[91]:
<Axes: xlabel='Importance', ylabel='None'>
No description has been provided for this image
  • Observations

  • MonthlyIncome, StockOptionLevel, and OverTime are the top drivers of attrition prediction.

  • Compensation-related features dominate the top rankings, indicating financial and reward factors are highly influential.

  • YearsAtCompany, DailyRate, and MonthlyRate also play key roles, showing tenure and pay structure relevance.

  • Behavioral and satisfaction features (e.g., JobSatisfaction, EnvironmentSatisfaction, WorkLifeBalance) have moderate to lower impact.

  • Business travel and job role variables contribute minimally, suggesting limited influence in the tuned model.

Boosting Models¶

Let's now look at the other kind of Ensemble technique knowns as Boosting

Understanding Boosting in Machine Learning

The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular are AdaBoost (short for Adaptive Boosting) and Gradient Boosting.

XGBoost¶

  • XGBoost stands for Extreme Gradient Boosting.
  • XGBoost is a tree-based ensemble machine learning technique that improves prediction power and performance by improvising on the Gradient Boosting framework and incorporating reliable approximation algorithms. It is widely utilized and routinely appears at the top of competition leader boards in data science.
In [ ]:
# # Installing the xgboost library using the 'pip' command.
# !pip install xgboost
In [92]:
# Importing the AdaBoostClassifier and GradientBoostingClassifier [Boosting]
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

# Importing the XGBClassifier from the xgboost library
from xgboost import XGBClassifier
In [93]:
# Adaboost Classifier
adaboost_model = AdaBoostClassifier(random_state = 1)

# Fitting the model
adaboost_model.fit(x_train, y_train)

# Model Performance on the test data
adaboost_model_perf_test = model_performance_classification(adaboost_model,x_test,y_test)

adaboost_model_perf_test
Out[93]:
Precision Recall Accuracy
0 0.808492 0.598877 0.861678
In [94]:
# Gradient Boost Classifier
gbc = GradientBoostingClassifier(random_state = 1)

# Fitting the model
gbc.fit(x_train, y_train)

# Model Performance on the test data
gbc_perf_test = model_performance_classification(gbc, x_test, y_test)

gbc_perf_test
Out[94]:
Precision Recall Accuracy
0 0.789661 0.648458 0.868481
In [95]:
# XGBoost Classifier
xgb = XGBClassifier(random_state = 1, eval_metric = 'logloss')

# Fitting the model
xgb.fit(x_train,y_train)

# Model Performance on the test data
xgb_perf_test = model_performance_classification(xgb,x_test,y_test)

xgb_perf_test
Out[95]:
Precision Recall Accuracy
0 0.786274 0.666882 0.870748
  • Observations

    • All three boosting models were tested on the same dataset to compare performance in predicting employee attrition.
    • AdaBoost Classifier achieved a test accuracy of 0.8617, with recall = 0.5989, showing good balance but lower sensitivity than others.
    • Gradient Boosting Classifier slightly outperformed AdaBoost with a test accuracy of 0.8684 and a better recall = 0.6485, making it more effective at detecting attrition cases.
    • XGBoost Classifier yielded the best recall of the three at 0.6688, with a test accuracy of 0.8707, combining robust predictive power with strong sensitivity to the positive class.
    • All boosting models demonstrate comparable precision (~0.78–0.80), but XGBoost shows superior balance between recall and accuracy.
  • Conclusion:

    • XGBoost is the most reliable of the three boosting models in this context, offering the best trade-off between recall and accuracy on the test set.
    • This makes XGBoost especially suitable for applications where identifying attrition cases (class 1) is a business priority.

Hyperparameter Tuning: Boosting¶

Hyperparameter tuning is a great technique in machine learning to develop the model with optimal parameters. If the size of the data increases, the computation time will increase during the training process.

  • For practice purposes, we have listed below some of the important hyperparameters for each algorithm that can be tuned to improve the model performance.
  1. Adaboost

Some important hyperparameters that can be tuned: base_estimator object, default = None The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes. If None, then the base estimator is DecisionTreeClassifier initialized with max_depth=1

  • n_estimators int, default = 50 The maximum number of estimators at which boosting is terminated. In the case of a perfect fit, the learning procedure is stopped early.

  • learning_rate float, default = 1.0 Weight is applied to each classifier at each boosting iteration. A higher learning rate increases the contribution of each classifier.

For a better understanding of each parameter in the AdaBoost classifier, please refer to thissource.

  1. Gradient Boosting Algorithm

Some important hyperparameters that can be tuned:

  • n_estimators: The number of boosting stages that will be performed.

  • max_depth: Limits the number of nodes in the tree. The best value depends on the interaction of the input variables.

  • min_samples_split: The minimum number of samples required to split an internal node.

  • learning_rate: How much the contribution of each tree will shrink.

  • loss:Loss function to optimize.

For a better understanding of each parameter in the Gradient Boosting classifier, please refer to this source.

  1. XGBoost Algorithm

Some important hyperparameters that can be tuned: booster [default = gbtree ] Which booster to use. Can be gbtree, gblinear, or dart; gbtree and dart use tree-based models while gblinear uses linear functions.

  • min_child_weight [default = 1]

The minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.The larger min_child_weight is, the more conservative the algorithm will be.

For a better understanding of each parameter in the XGBoost Classifier, please refer to this source.

Comparison of all the models we have built so far

In [100]:
models_test_comp_df = pd.concat([dtree_test.T,
                                 dtree_tuned_test.T,
                                 rf_estimator_test.T,
                                 rf_estimator_tuned_test.T,
                                 adaboost_model_perf_test.T,
                                 gbc_perf_test.T,
                                 xgb_perf_test.T], axis = 1)

models_test_comp_df.columns = ["Decision Tree classifier",
                               "Tuned Decision Tree classifier",
                               "Random Forest classifier",
                               "Tuned Random Forest classifier",
                               "Adaboost classifier",
                               "Gradientboost classifier",
                               "XGBoost classifier"]

print("Test performance comparison:")
Test performance comparison:
In [101]:
models_test_comp_df
Out[101]:
Decision Tree classifier Tuned Decision Tree classifier Random Forest classifier Tuned Random Forest classifier Adaboost classifier Gradientboost classifier XGBoost classifier
Precision 0.609450 0.556216 0.814815 0.753133 0.808492 0.789661 0.786274
Recall 0.627198 0.603388 0.546593 0.661477 0.598877 0.648458 0.666882
Accuracy 0.775510 0.582766 0.850340 0.861678 0.861678 0.868481 0.870748

Conclusion¶

Final Model Comparison Observations (Test Set)

  • Among all models, XGBoost Classifier delivers the highest test accuracy (0.8707) and the best recall (0.6688), making it the top performer overall.
  • Gradient Boosting Classifier follows closely with an accuracy of 0.8685 and recall of 0.6485, showing robust performance across all metrics.
  • Both AdaBoost and Tuned Random Forest models provide a strong balance, each achieving an accuracy of 0.8617, though XGBoost still edges them out in recall.
  • Untuned Random Forest Classifier demonstrates high precision (0.8148) but suffers from low recall (0.5469), indicating it's overly conservative in predicting attrition.
  • Decision Tree (tuned or not) significantly underperformed in comparison, with lower accuracy and recall (e.g., tuned DT recall = 0.6034, accuracy = 0.5828).
  • In general, boosting methods outperformed basic tree models and random forests in both recall and accuracy, highlighting their effectiveness on this classification task.

Recommendation:

  • XGBoost should be the preferred model for deployment due to its superior balance of recall and accuracy, which is crucial for correctly identifying employees likely to leave.
In [39]:
notebook_path = '/content/drive/MyDrive/My DS DA/Employee Attrition/Employee Attrition Prediction.ipynb'

!jupyter nbconvert --to html "{notebook_path}"

from google.colab import files
files.download('/content/drive/MyDrive/My DS DA/Employee Attrition/Employee Attrition Prediction.html')
In [39]: